!pr3
The Boyer-Morris String Search Algorithm..........Bob Bernard
                                                 Westport, CT

For years now, I have been working on a debugger for the Apple.  Lately I have been adding a hex string search capability to it.

I needed one so I could look through the Apple IIc (ProDOS) utilites to see how it squirrels away in the alternate page screen holes user specified default settings for the serial ports.  These are used at PR#1 or 2 time to simulate the dip switches on the Super Serial Card in a IIe.  Without setting them you always get 9600 bps, etc.  (Imagewriter settings, that is).  I (and I assume other AAL readers) want a little routine for DOS 3.3 hello that will allow the user's defaults to be put away the same as the IIc utility does. 

Well, that routine is not ready yet.  However, the search utility is rather interesting in its own right.

I was just going to code up a straight hex search, but then I mentioned it to my computer science graduate son, David.  He was horrified that I would waste my time on anything so crude.  That's what I get for bringing up a programmer!  David insisted that I should instead code an implementation of Boyer and Moore's algorithm, which appeared in the October 1977 issue of the Communications of the ACM.  [A more recent reference is in the book "Algorithms", by Robert Sedgewick, (Addison-Wesley Publishing Co., 1983, 551 pages) on pages 249-252.]

Well, I read the article and it seemed like a challenge.  Besides it looked like a real time saver, and could also be used for character string searches.  The code here has been excerpted from my debugger, and then worked over by Bob S-C. 

The "conventional" or "brute-force" search technique aligns the search pattern with the left end of the string to be searched through and compares one byte at a time, from left to right, until either the entire pattern is compared successfully or a mismatch occurs.  In the latter case the search window is moved one byte to the right, and the comparing process is repeated. 

Without any knowledge about the contents of the search pattern, the most the window can be moved is one place to the right.  Boyer-Moore owes its speed advantage to the fact that it uses context (i.e.  knowledge about the contents of the pattern to be searched for) to increase the distance that the search window can be advanced when a mismatch occurs.  Thus efficiency increases as the length of the pattern increases, which does not happen in a conventional search. 

The cost of this benefit (there always is a cost) is that a table (called DELTA1 in the CACM article and DELTA.TABLE in my program) is required to store this context information, 256 bytes in this implementation.  One byte is needed in the table for every possible value of the characters in the string to be searched.

If a particular byte appears in the search pattern, then the corresponding DELTA table entry contains the distance that the rightmost occurrence of that byte is from the left end of the pattern.  All other entries contain the value -1.  When a mismatch occurs, the DELTA table entry corresponding to that byte from the text being searched is used to compute how far to advance the search window.  If that byte does not appear anywhere in the pattern, then the search window can be advanced by the full length of the pattern.

Since moving the search window, and the associated testing for finished, take most of the time in any searching technique, saving time here can be extremely beneficial, and explains why Boyer and Moore should be complimented.

My program uses the control-Y monitor command, in the form

     adr1.adr2^Y <hexstring>

The two addresses specify the start and end of the area to be searched.  "^Y" stands for "control-Y".  The hex string may be separated from the control-Y by one or more spaces, if you desire.  Since the control-Y doesn't show up on the screen, I usually type at least one space before the hex string.  The hex string itself is a continuous string of hex digits, with no imbedded spaces.  Here is an example that will search from $800 to $BFFF for "BERNARD":

     800.BFFF^Y 4245524E415244

The program will list the starting addresses of any and all complete matches that are found.

The maximum length of the hex string is limited by the monitor input buffer.  Since the longest command you can type is less than 256, and you have to use around ten characters for the addresses and control-Y, that puts an upper limit of less than 246 hex digits in your command.  Each byte of the search pattern (or "key") is made up of two hex digits, so the maximum hex string will be less than 123 bytes long.

I assigned DELTA.TABLE to the area $02D0.03CF.  Since I scan and collect the search pattern right in the monitor keyboard buffer at $0200, after converting to hex bytes it will run no higher than $027F.

Actually, I only implemented a simplified version of Boyer and Moore's procedure.  The CACM article also discusses a second table, DELTA2, which is filled with additional context information regarding "terminating substrings" of the search pattern.  In cases where a partial mismatch occurs, it may be possible to advance the search window farther than the DELTA1 table would indicate.  However, since such situations occur in less than 20% of the cases, David allowed that the potential additional speed did not justify the time and effort and the additional table and code space that would have been required, and he gave me a passing grade on my effort without it.  The incorporation of this additional capability, and changes to make the program an ASCII search, are left "as a exercise for the reader."

My program must go through several steps.  First it has to find and pack up the search key.  Next it must build the DELTA table.  And finally the search can be performed.

Lines 1290-1360 will be executed when you BRUN the program.  They install the control-Y vector and jump into the monitor, just as though you entered with CALL-151.

When you enter the search command, the Apple monitor parses the command line up to and including the control-Y, and then branches to my code at line 1380.  The two addresses will have been converted and stuffed into A1 ($3C,3D) and A2 ($3E,3F).  A variable named YSAV (at $34) contains the index to the next character following the control-Y.

Lines 1400-1440 skip over any blanks you may have typed between the control-Y and the first hex digit.  Actually, the Y-register gets incremented once too often, so lines 1460-1470 decrement Y and save it; now YSAV points to the first hex digit in the search key.

The next problem I had to solve was to differentiate odd from even length strings and arrange them properly, adding a leading zero when an odd number of hex digits is input.  Lines 1490-1530 search for the end of the hex string; if there are no digits at all, we are finished and line 1530 returns for the next monitor command.

This is a nice place to insert a brief description of the NXTCHAR subroutine, found in lines 2460-2590.  NXTCHAR picks up the next character from the input buffer, and tests to see if it is a hex digit.  If so, it returns either $00-09 or $FA-FF in the A-register, and carry will be clear.  If not a hex digit, it returns with carry set.  If we got a digit, the Y-register indexing the input buffer will have been advanced.

Lines 1550-1590 compute the key length.  Since two digits make a byte, the number of digits in the hex string divided by two gives the number of bytes.  But I actually want to use the byte-count-minus-one.  Also I need to adjust for odd or even length strings.  Lines 1600-1650 take care of these details.  If the count was odd, I jump into the middle of the packing loop so that a leading zero gets inserted.

Lines 1670-1800 comprise the packing loop.  NXTCHAR will return with carry set when we try to get a digit beyond the end of the key, so line 1680 is the only test in the loop.  Lines 1670-1730 retrieve a left-hand digit and store it in the buffer.  Lines 1740-1800 do the same for right-hand digits.  Key bytes are stored starting at $0200, so they never catch up to the advancing retrieval of digits.

Line 1810 sets YSAV to point to the first character past the end of the hex string.  This will usually be a carriage return, or another monitor command.  Unless it is beyond $2CF, the monitor will correctly continue parsing whatever is in the buffer when we are through searching.  At $2D0 and beyond, the DELTA table will clobber any further characters.

Now we come to the Boyer-Moore part.  Lines 1820-1870 initialize the DELTA table to all -1 values, which is what we want for any bytes not present in the key.  When the loop finishes, X=0 again.

Lines 1880-1970 scan through the search key from left to right, and store into DELTA the index of the rightmost occurrence of each value in the key.  For example, if the key is "4245524E415244" ("BERNARD" again), the DELTA values will be:

     DELTA+$41:  4
     DELTA+$42:  0
     DELTA+$44:  6
     DELTA+$45:  1
     DELTA+$4E:  3
     DELTA+$52:  5 (also at 2, but 5 is rightmost)
    all others:  -1

We'll continue with this example after a brief look at the rest of the code.

Lines 1980-2040 back up the end pointer, which has been patiently waiting all this time in A2L and A2H.  We subtract the key length (in bytes, not digits) from the end pointer, so that we will not try to match the key any further than necessary.  We could do this inside the search loop, but it will run faster if we do it once before the loop.

Lines 2050-2440 perform the search.  I inserted lines 2070-2110 inside the loop to printout the search window start address each time through the loop.  This helps me to make sure it is working, and to explain how.  Of course you should remove these five lines before using the routine for real problems.  Notice they are all marked "<<<DEBUG>>>".

Lines 2120-2170 check whether the beginning of the search window has moved past the end of the area to be searched.  If so, we are finished.

Lines 2180-2240 compare  bytes from the key and the search window.  If the entire key matches, we fall out of the loop into lines 2250-2300, where the address of the match will be printed.  After a successful match the search window will be moved one byte to the right by lines 2370-2430, and we will begin the SEARCH.LOOP again.

Notice that the key is compared from right-to-left, not left- to-right.  This is a critical part of the Boyer-Moore method.  If a key byte does not match a search-window byte, we branch to line 2320.  The byte from the search window is in the A-register.  Lines 2320-2370 compute how far we can advance the search window, based on just what character we DID find in the search window, and how far into the key we had already matched.

To see how this works, let's continue the "BERNARD" example.  Suppose the text we are searching is "THERE ARE FEW ST. BERNARDS IN SAN BERNARDINO."  The key will be BERNARD, entered in hex as shown above.  We first try to match BERNARD at the beginning of the text.  We start at the right end, matching the "D" of the key with "A" of the text.  The match fails, so we look up the "A" value in the DELTA table, which is 4.  We subtract the delta value (4) from the current key index (6) and add the result (6-4=2) to the search window address.  Note that this has the result of aligning the "A" of BERNARD with the "A" in the text.

Back to the top, and we now try to match the "D" of BERNARD to the "E" at the end of "ARE".  Failure again!  This time the DELTA value is 1, and we are still at position 6 in the key:  index-delta is 5, so we advance the window by 5.  This lines up the "E" of BERNARD with the E of the text.  The next attempted match will find a blank in the text, which does not occur in the key at all.  The delta value for blank is -1:  6-(-1)=7, so we will advance the window by 7.  Now the window is up to "ST. BER" in the text.

When we compare "D" of BERNARD to "R" in the text, we fail again.  The delta value for R is 5.  There are two R's in BERNARD, but the rightmost one is at index 5.  We can move the search window by 6-5=1.  Next we try "D" against "N".  The delta value of "N" is 3, so we can move the window 6-3=3 bytes.  This time we have found "BERNARD"!

If you count it all up, we have compared the "D" of BERNARD with only six characters, and already we are at the first occurrence of the whole key in the text.  A conventional search would have tried to match the first character of the key ("B") with all 18 characters in the text which precede the first "B" of the text.  We have saved 13 times around the main loop!  Of course, our loop is a tiny bit longer, but the end result is faster.

Here is a step-by-step picture of the entire search, which finds BERNARD twice:
 
     THERE ARE FEW ST. BERNARDS IN SAN BERNARDINO.
     BERNARD
       BERNARD
            BERNARD
                   BERNARD
                    BERNARD
                       BERNARD  (success!)
                        BERNARD
                               BERNARD
                                  BERNARD
                                       BERNARD  (success!)
                                        BERNARD
                                               BER... (end)

I have tacked two more examples onto the end of the source code, at lines 2620-2690.  You can play with them.  The five <<<DEBUG>>> lines will print out the window address at each step, so you can see how the search progresses.  Remember to take those lines out before you make a production version of the program.

If you decide to include this search algorithm in your own private debugger program, like I am, you might want to add the ability to use an ASCII string for the key.  You could use a quotation mark after the control-Y to signal the packer loop that an ASCII string follows.  You might also want to add single-byte wildcard characters, and/or the ability to ignore the high-order bit of each byte matched.

Perhaps the Boyer-Moore algorithm would be even more useful in a data base program, a word processor, or other context in which you are searching through huge quantities of text for relatively interesting keys.  My example should get you started, and my son will be proud of you!
1
